Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Quantization of Neural Networks

which is deﬁned as

H(Qa(x)) = −

p(qx) log p(qx) = ¹

2 ^{log 2}^πeσ²

x^,

max H(Qa(x)) = ⁿ^{ln 2}

2ⁿ^,^when^p⁽^q^x^{) = 1}

2ⁿ^,

(2.19)

where qx are the random quantized variables in Qa(x) (which is Qa(q) or Qa(k) under

diﬀerent conditions) with probability mass function p(·). The information entropy in the

quantization process should be maximized to retain the information contained in the MHSA

modules from their full-precision counterparts.

However, direct application of a quantization function that converts values into ﬁnite

ﬁxed points brings about irreversible disturbance to the distributions and the information

entropy H(Qa(q)) and H(Qa(k)) degenerates to a much lower level than its full precision

counterparts. To mitigate the information degradation from the quantization process in the

attention mechanism, an Information Rectiﬁcation Module (IRM) is proposed to eﬀectively

maximize the information entropy of quantized attention weights.

Qa(˜q) = Qa( ^q⁻^μ⁽^q^{) +}^β^q

γq

σ²(q) + ϵq

Qa(^˜k) = Qa( ^k⁻^μ⁽^k^{) +}^β^k

γk

σ²(k) + ϵk

(2.20)

where γq, βq and γk, βk are the learnable parameters to modify the distribution of ˜q, while

ϵq and ϵk are constants that prevent the denominator from being 0. The learning rates of

the learnable γq, βq and γk, βk are the same as for the entire network. Thus, after IRM, the

information entropy H(Qa(˜q)) and H(Qa(^˜k)) is formulated as

H(Q(˜q)) = ¹

2 ^{log 2}^πe^[^γ²

q⁽^σ²

q ⁺^ϵ^q^)]^,^H⁽^Q^(˜^k^{)) = 1}

2 ^{log 2}^πe^[^γ²

k⁽^σ²

k ⁺^ϵ^k^)]^.

(2.21)

Then, to revive the attention mechanism to capture critic elements by maximizing infor-

mation entropy, the learnable parameters γq, βq and γk, βk reshape the distributions of the

query and key values to achieve the maximum state of information. In a nutshell, in our

IRM-Attention structure, the information entropy of quantized attention weight is maxi-

mized to alleviate its severe information distortion and revive the attention mechanism.

2.3.4

Distribution Guided Distillation Through Attention

To address the attention distribution mismatch that occurred in the fully quantized ViT

baseline in backward propagation, we further propose a distribution-guided distillation

(DGD) scheme with apposite distilled activations and well-designed similarity matrices to

eﬀectively utilize teacher knowledge, which optimizes fully quantized ViT more accurately.

As an optimization technique based on element-level comparison of activation, distilla-

tion allows the quantized ViT to mimic the full-precision teacher model about output logits.

However, we ﬁnd that the distillation procedure used in the previous ViT and fully quan-

tized ViT baseline (Section 2.3.1) is unable to deliver meticulous supervision to attention

weights (shown in Fig. 2.6), leading to insuﬃcient optimization. To solve the optimization

insuﬃciency in the distillation of the fully quantized ViT, we propose the Distribution-

Guided Distillation (DGD) method in Q-ViT. We ﬁrst build patch-based similarity pattern

matrices for distilling the upstream query and key instead of attention following [226], which

is formulated as

˜G^l

qh ^{= ˜}^q^l

h ^·^(˜^q^l

h⁾^⊤^{, G}⁽^l⁾

qh ^{= ˜}^G^l

qh^/^∥^˜^G^l

qh^∥²^,

˜G^l

kh ^{= ˜}^k^l

h ^·^(˜^k^l

h⁾^⊤^{, G}⁽^l⁾

kh ^{= ˜}^G^l

kh^/^∥^˜^G^l

kh^∥²^,

(2.22)